Speeding up R

Stuart Lacy

2023-06-28

Introduction

  • Focus on techniques for speeding up analysis of tabular data
  • Subjects:
    • Vectorization
    • Joins
    • Rcpp
    • data.table
    • Alternative backends

Vectorization

Vectorization Concept

For loops in R are slow - many StackOverflow posts

  • In general, if you’re using a for loop (or a sapply variant), then your code could be sped up by using a vectorised function
  • Definition: f(x[i]) = f(x)[i] for \(i \in 1, ..., N\)
  • sqrt(c(4, 9, 16)) = 2, 3, 4, therefore sqrt is vectorised
  • Using vectorised functions often results in cleaner code with less chance for bugs
  • There are a lot of vectorised functions available in the standard library

Standard library vectorised functions

  • Non-vectorised
for (i in 1:nrow(df)) {
  # New column based on 2 others
  if (df$x[i] > 5 && df$y[i] < 0) {
    df$y[i] <- 5
  } else {
    df$y[i] <- 0
  }
  
  # Replace NAs with error code
  if (is.na(df$z[i])) {
    df$z[i] <- 9999 
  }
  
  # String concatenate columns
  df$xy[i] <- paste(df$x[i], df$y[i], sep="_")
}
  
# Distance between every row
dists <- matrix(nrow=nrow(df), ncol=nrow(df))
for (i in 1:nrow(df)) {
  for (j in 1:nrow(df)) {
    if (j != i) {
      dists[i, j] <- sqrt((df$x[i] - df$x[j])**2 + (df$y[i] - df$y[j])**2)
    }
  }
}
  • Vectorised

# New column based on 2 others
df$y <- ifelse(df$x > 5 & df$y < 0, 5, 0)





# Replace NAs with error code
df$z[is.na(df$z)] <- 9999



# Concatenate columns
df$xy <- paste(df$x, df$y, sep="_")


# Distance between every row
dist(df[, c('x', 'y')])

Worked example

Given a path some/path/abc/001.txt, create a fast function to return abc_001.txt

  • First attempt works on a single path at a time, separating it by / and concatenating the last directory and filename
  • Doesn’t work for a vector input, is there an easy way to vectorise it?
example_1 <- function(path) {
  path_list <- str_split(path, "/") %>% unlist()
  paste(path_list[length(path_list) - 1], path_list[length(path_list)], sep = "_")
}
example_1(c("foo/bar/car/001.txt", "har/far/lar/002.txt"))
[1] "lar_002.txt"

Version 1 - Vectorize

  • Vectorize takes a function that works on a single element, and returns a vectorised version - job done!
  • However, it just uses apply under the hood and isn’t quicker, mostly just syntatical sugar
example_1_vectorised <- Vectorize(example_1)  # This returns a *function*
example_1_vectorised(c("foo/bar/car/001.txt", "har/far/lar/002.txt"))
foo/bar/car/001.txt har/far/lar/002.txt 
      "car_001.txt"       "lar_002.txt" 

Version 2

  • Want to replace this implicit for loop with inbuilt vectorised functions
  • str_split is vectorised, returning a list over the input entries
  • paste is also vectorised
  • ❌ Need to use a for loop (sapply) to grab the last dir and filename from each entry
  • Overall have reduced the computation done inside the for loop
example_2 <- function(paths) {
  path_list <- str_split(paths, "/")
  last_two <- sapply(path_list, tail, 2)
  paste(last_two[1, ], last_two[2, ], sep="_")
}
example_2(c("foo/bar/car/001.txt", "har/far/lar/002.txt"))
[1] "car_001.txt" "lar_002.txt"

Version 3

  • We can’t directly replace this for loop with a single vectorised function, have to take another approach
  • dirname('foo/bar/dog.txt') = foo/bar
  • basename('foo/bar/dog.txt') = dog.txt
  • Combining these can give us our entire functionality in 4 inbuilt vectorised function calls!
example_3 <- function(paths) {
  paste(basename(dirname(paths)), basename(paths), sep="_")
}
example_3(c("foo/bar/car/001.txt", "har/far/lar/002.txt"))
[1] "car_001.txt" "lar_002.txt"

Comparison

  • The microbenchmark library makes it easy to time snippets of code
  • The Vectorize version isn’t doing anything different from manually looping through with sapply
library(microbenchmark)
# Construct 100 paths
paths <- rep(c("some/path/abc/001.txt", "another/directory/xyz/002.txt"), 100)

res <- microbenchmark(
  example_1_vectorised(paths),
  sapply(paths, example_1),
  example_2(paths),
  example_3(paths)
)

summary(res)[c("expr", "median")]
                         expr   median
1 example_1_vectorised(paths) 7868.738
2    sapply(paths, example_1) 7896.205
3            example_2(paths) 1416.520
4            example_3(paths)  139.350

Conclusions

In general, if you’re using a for loop (or a sapply variant), then your code could be sped up by using a vectorised function - Me (7 slides ago)

  • This wasn’t fully correct, as vectorised functions can have for loops under the hood and will thus still be slow
  • The difference between a vectorised function built using Vectorize or an inbuilt function like basename is that the latter will have a for loop, but it will be written in C/C++ rather than R

In general, if you’re using a for loop (or a sapply variant), then your code could be sped up by using a for loop written in C/C++, preferably part of the standard library - Me (now)

  • Later on will demonstrate how to write our own C++ functions

DataFrames & Joining

Basic DataFrame operations

  • Fortunately working with data.frames and the tidyverse core verbs pushes you towards using vectorised functions
  • group_by() |> summarise() is both quicker and more legible than manually looping over the groups and combining the results
  • filter(f(x)) assumes that f() is vectorised and returns a boolean TRUE/FALSE for every row
  • mutate(newcol=f(oldcol)) assumes f() is vectorised and returns a value per row
  • Caution, can run into errors or unexpected behaviour if not using vectorised functions
# Non-vectorised version didn't Error, 
# but gave an unexpected result
data.frame(path=paths) |>
  mutate(path_clean1 = example_1(path),
         path_clean2 = example_3(path)) |>
  head()
                           path path_clean1 path_clean2
1         some/path/abc/001.txt xyz_002.txt abc_001.txt
2 another/directory/xyz/002.txt xyz_002.txt xyz_002.txt
3         some/path/abc/001.txt xyz_002.txt abc_001.txt
4 another/directory/xyz/002.txt xyz_002.txt xyz_002.txt
5         some/path/abc/001.txt xyz_002.txt abc_001.txt
6 another/directory/xyz/002.txt xyz_002.txt xyz_002.txt
# This function isn't vectorised due to the if/else statement
# Solution: Use ifelse() instead
replace_both_NA_9999 <- function(x, y) {
  if (is.na(x) && is.na(y)) {
    return(9999)
  } else {
    return(0)
  }
}

data.frame(a=c(5, 3, NA, 2, NA), 
           b=c(NA, 2, NA, 1, 9)) |>
  mutate(c = replace_both_NA_9999(a, b))
Error in `mutate()`:
ℹ In argument: `c = replace_both_NA_9999(a, b)`.
Caused by error in `is.na(x) && is.na(y)`:
! 'length = 5' in coercion to 'logical(1)'

Joining

  • Linking 2 datasets together using the join family of functions is an integral part of data analysis
  • However, join functions are highly efficient functions and can be useful in a number of siutations, even when we don’t have 2 separate datasets
  • inner_join links two dataframes together based on a column in common, with the number of rows equal to the number of rows in the ‘left’ table that have a matching row in the ‘right’ table
df_1: Rows 1 - 3 out of 3
group value1
a 1
b 2
c 3
df_2: Rows 1 - 3 out of 3
group value2
b 4
c 5
d 6
joined <- df_1 |> inner_join(df_2, by="group")
joined: Rows 1 - 2 out of 2
group value1 value2
b 2 4
c 3 5

Example usage: inner join instead of ifelse

  • Can think of inner_join as being able to both filter and mutate new columns
  • Example: apply different per-group scaling factor to 300,000 measurements from 3 groups
  • On one joining column join isn’t much quicker, but it’s far more legible and scales well to both having more groups in the joining column, and additional joining columns
df: Rows 1 - 5 out of 300000
group time value
a 2020-03-05 00:00:00 0.7225352
a 2020-03-05 00:01:00 0.9641624
a 2020-03-05 00:02:00 0.3115964
a 2020-03-05 00:03:00 -0.0363529
a 2020-03-05 00:04:00 -0.6124932
scales: Rows 1 - 3 out of 3
group scale
a 2.0
b 7.8
c 9.0
joined: Rows 1 - 5 out of 300000
group time value scale
a 2020-03-05 00:00:00 0.7225352 2
a 2020-03-05 00:01:00 0.9641624 2
a 2020-03-05 00:02:00 0.3115964 2
a 2020-03-05 00:03:00 -0.0363529 2
a 2020-03-05 00:04:00 -0.6124932 2
f_join <- function() {
  df |> inner_join(scales, by="group")
}

f_ifelse <- function() {
  df |>
    mutate(scale = ifelse(group == 'a', 2,
                          ifelse(group == 'b', 7.8,
                                 ifelse(group == 'c', 9, NA))))
  
}

res <- microbenchmark(f_join(), f_ifelse(), times=10)
summary(res)[c("expr", "median")]
        expr   median
1   f_join() 22.45482
2 f_ifelse() 27.31786

left_join

  • A left_join returns all rows in the left table, but only those in the right that match the condition
  • Any column from the right table that didn’t have a match in the left table is filled with NA
df1
  group val1
1     a    1
2     b    2
3     c    3
4     d    4
df2
  group val2
1     a    1
2     b    4
3     c    9
df1 |> 
  left_join(df2, by="group")
  group val1 val2
1     a    1    1
2     b    2    4
3     c    3    9
4     d    4   NA
df1 |> 
  inner_join(df2, by="group")
  group val1 val2
1     a    1    1
2     b    2    4
3     c    3    9

Example usage: filling gaps with left_join

  • Very useful if want to be aware of missing values
  • Useful for filling gaps in non-uniformly sampled time-series so can count missingness or interpolate
df
        date measurement
1 2020-01-01   0.9338962
2 2020-01-03   2.8015174
3 2020-01-05   0.5538004
all_times
        date
1 2020-01-01
2 2020-01-02
3 2020-01-03
4 2020-01-04
5 2020-01-05
all_times |> left_join(df, by="date")
        date measurement
1 2020-01-01   0.9338962
2 2020-01-02          NA
3 2020-01-03   2.8015174
4 2020-01-04          NA
5 2020-01-05   0.5538004

Interval joins

  • Joins aren’t limited to joining on equal values, can also join on intervals or closest value
  • Example: Have measurements from every day in 2020, but want to limit analysis to 5 specific weeks
df_interval: Rows 1 - 10 out of 366
time measurement
2020-01-01 2.9016961
2020-01-02 0.0465646
2020-01-03 -1.5239673
2020-01-04 0.2516128
2020-01-05 0.1280887
2020-01-06 -2.8242639
2020-01-07 -1.4247783
2020-01-08 1.4217469
2020-01-09 -1.5889684
2020-01-10 -0.7023941
weeks: Rows 1 - 5 out of 5
week_group week_start week_end
a 2020-02-14 2020-02-21
b 2020-03-17 2020-03-24
c 2020-05-08 2020-05-15
d 2020-09-20 2020-09-27
e 2020-11-13 2020-11-20
joined <- df_interval |>
  inner_join(weeks, 
             by=join_by(time >= week_start, time < week_end))
joined: Rows 1 - 10 out of 35
time measurement week_group week_start week_end
2020-02-14 -0.2035220 a 2020-02-14 2020-02-21
2020-02-15 -0.2149556 a 2020-02-14 2020-02-21
2020-02-16 0.3451100 a 2020-02-14 2020-02-21
2020-02-17 -0.4586642 a 2020-02-14 2020-02-21
2020-02-18 -0.6964497 a 2020-02-14 2020-02-21
2020-02-19 2.4552832 a 2020-02-14 2020-02-21
2020-02-20 0.7747280 a 2020-02-14 2020-02-21
2020-03-17 -0.6530212 b 2020-03-17 2020-03-24
2020-03-18 -1.3282548 b 2020-03-17 2020-03-24
2020-03-19 0.7340179 b 2020-03-17 2020-03-24

Benchmark

  • On only 366 rows with 5 groups it is 10x as fast, will scale better, and is more understandable
f_intervaljoin <- function() {
  df_interval |>
    inner_join(weeks, by=join_by(time >= week_start, time < week_end))
}

f_ifelse <- function() {
  df_interval |>
    mutate(week_group = ifelse(time >= as_date("2020-02-14") & time < as_date("2020-02-21"),
                               'a',
                               ifelse(time >= as_date("2020-03-17") & time < as_date("2020-03-24"),
                                      'b',
                                      ifelse(time >= as_date("2020-05-08") & time < as_date("2020-05-15"),
                                             'c',
                                             ifelse(time >= as_date("2020-09-20") & time < as_date("2020-09-27"),
                                                    'd',
                                                    ifelse(time >= as_date("2020-11-13") & time < as_date("2020-11-20"),
                                                           'e', 
                                                           NA)))))) |>
    filter(!is.na(week_group))
}

res <- microbenchmark(f_intervaljoin(), f_ifelse(), times=10)
summary(res)[c("expr", "median")]
              expr    median
1 f_intervaljoin()  2.254611
2       f_ifelse() 27.822321

Different backends

Example dataset

  • What if we’re using fast functions but still experiencing slow performance due to dataset’s size?
  • Example dataset: Company House data containing 5 million rows (440MB archive download, extracts to 2.4GB) of all companies incorporated in the UK since 1856
  • Using first million rows as an example
df <- read_csv("BasicCompanyDataAsOneFile-2023-05-01.csv", n_max=1e6, show_col_types=FALSE)
df$IncorporationDate <- as_date(df$IncorporationDate, format="%d/%m/%Y")
dim(df)
[1] 1000000      55
df |> 
  select(CompanyName, RegAddress.PostTown, IncorporationDate, SICCode.SicText_1) |>
  head()
# A tibble: 6 × 4
  CompanyName            RegAddress.PostTown IncorporationDate SICCode.SicText_1
  <chr>                  <chr>               <date>            <chr>            
1 ! HEAL UR TECH LTD     GUILDFORD           2022-10-12        33140 - Repair o…
2 ! LTD                  LEEDS               2012-09-11        99999 - Dormant …
3 !? LTD                 ROMILEY             2018-06-05        47710 - Retail s…
4 !BIG IMPACT GRAPHICS … LONDON              2018-12-28        18129 - Printing…
5 !GOBERUB LTD           BISHOP'S STORTFORD  2021-05-17        62020 - Informat…
6 !NFOGENIE LTD          LONDON              2021-07-21        58290 - Other so…

Question 1: How many companies have the same name?

  • Will use several basic research questions to have some ‘real-world’ analysis code to benchmark
  • How many companies have the same name?
df |> 
  count(CompanyName) |> 
  filter(n > 1) |>
  nrow()
[1] 773

Question 2: What York postcode has the most businesses?

  • Want to find the 5 postcodes with most businesses being created in York
  • Need to do some string manipulation to extract the first part of the YOXX YYY postcode format
df |> 
  filter(RegAddress.PostTown == 'YORK') |> 
  mutate(postcode = word(RegAddress.PostCode, 1, sep=" ")) |>
  count(postcode) |>
  arrange(desc(n)) |>
  head(5)
# A tibble: 5 × 2
  postcode     n
  <chr>    <int>
1 YO30       486
2 YO19       325
3 YO26       271
4 YO1        241
5 YO31       179

Question 3: Classifications

  • Companies can be assigned with up to 4 classifications from a list of 1,042 options
  • Do classifications tend to cluster together? I.e. is the average number of classifications a company has related to the first classification?
  • Slightly tenuous example but wanted to demonstrate pivoting + joining!
  • Only want to look at classifications that are used by at least 10 companies (inner_join to filter)
  • Multiple classifications are stored in 4 wide columns that are NA when unused - easier to count the number of non-null column entries in long format
# A tibble: 6 × 5
  CompanyName              SICCode.SicText_1 SICCode.SicText_2 SICCode.SicText_3
  <chr>                    <chr>             <chr>             <chr>            
1 ! HEAL UR TECH LTD       33140 - Repair o… 47421 - Retail s… <NA>             
2 ! LTD                    99999 - Dormant … <NA>              <NA>             
3 !? LTD                   47710 - Retail s… <NA>              <NA>             
4 !BIG IMPACT GRAPHICS LI… 18129 - Printing… 59112 - Video pr… 63120 - Web port…
5 !GOBERUB LTD             62020 - Informat… 70229 - Manageme… 79110 - Travel a…
6 !NFOGENIE LTD            58290 - Other so… <NA>              <NA>             
# ℹ 1 more variable: SICCode.SicText_4 <chr>

Question 3: Classifications (code)

# 755 rows containing the SIC codes that at least 10 companies have
# Only 1 column, SICCode.SicText_1
sic_10_companies <- df |> 
                count(SICCode.SicText_1) |>
                filter(n >= 10) |>
                select(SICCode.SicText_1)

df |>
  # Could do a filter to restrict to these 10 companies, but it's actually quicker to use an inner join
  inner_join(sic_10_companies, by="SICCode.SicText_1") |>
  select(CompanyNumber, SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4) |> 
  mutate(first_classification = SICCode.SicText_1) |>
  # Pivoting to make it easier to count how many non-NULL classifications each company has
  pivot_longer(c(SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4)) |>
  filter(!is.na(value)) |>
  # Count how many classifications each company has
  count(CompanyNumber, first_classification) |>
  # Calculate the average number per the first classification
  group_by(first_classification) |>
  summarise(mean_classifications = mean(n, na.rm=T)) |>
  arrange(desc(mean_classifications)) |>
  head()
# A tibble: 6 × 2
  first_classification                                      mean_classifications
  <chr>                                                                    <dbl>
1 7210 - Hardware consultancy                                               3   
2 07100 - Mining of iron ores                                               2.48
3 10611 - Grain milling                                                     2.43
4 18110 - Printing of newspapers                                            2.43
5 14131 - Manufacture of other men's outerwear                              2.40
6 01280 - Growing of spices, aromatic, drug and pharmaceut…                 2.39

data.table

Introduction

  • data.table is an alternative to data.frame/tibble that is optimised for speed and low memory usage
  • The trade-off is that its API is a bit/lot less user friendly
library(data.table)
dt <- fread("BasicCompanyDataAsOneFile-2023-05-01.csv", nrows=1e6)         # fread is the equivalent of read.csv
dt[, IncorporationDate := as_date(IncorporationDate, format="%d/%m/%Y") ]  # Creates a new column by *reference*
dim(dt)
[1] 1000000      55
# Display rows 1-5 and the specified columns
dt[1:5, .(CompanyName, RegAddress.PostTown, IncorporationDate, SICCode.SicText_1)]
                    CompanyName RegAddress.PostTown IncorporationDate
1:           ! HEAL UR TECH LTD           GUILDFORD        2022-10-12
2:                        ! LTD               LEEDS        2012-09-11
3:                       !? LTD             ROMILEY        2018-06-05
4: !BIG IMPACT GRAPHICS LIMITED              LONDON        2018-12-28
5:                 !GOBERUB LTD  BISHOP'S STORTFORD        2021-05-17
                                       SICCode.SicText_1
1:                33140 - Repair of electrical equipment
2:                               99999 - Dormant Company
3: 47710 - Retail sale of clothing in specialised stores
4:                               18129 - Printing n.e.c.
5: 62020 - Information technology consultancy activities

Counting number of companies with the same name

  • Generally, dt[i, j, k] means for data table dt, filter on rows i, create and/or select columns j, and group by k
  • data.table operations don’t use the Pipe (|> or %>%), so can either chain together [] or create intermediate variables
  • data.table have data.frame as a class so can use standard functions on them, just won’t benefit from the speed up
  • .N is the equivalent of count
nrow( dt[ , .N, by=.(CompanyName) ][ N > 1 ] )
[1] 773

York Postcodes with most business

  • In this example it’s easier to create an intermediate variable than use a one-liner
  • .SD applies an operation to a subset of columns (all by default)
postcodes <- dt[ RegAddress.PostTown == 'YORK', .(postcode = word(RegAddress.PostCode, 1))][, .N, by=postcode]
postcodes[order(-postcodes$N), head(.SD, 5)]
   postcode   N
1:     YO30 486
2:     YO19 325
3:     YO26 271
4:      YO1 241
5:     YO31 179
# Alternative one-liner
#setorder(dt[ RegAddress.PostTown == 'YORK', .(postcode = word(RegAddress.PostCode, 1))][, .N, by=postcode], -N)[, head(.SD, 5)]

Number of classifications

  • Joins are less intuitive. x[y] is equal to left_join(y, x), NOT inner_join(x, y)
  • melt is equivalent to pivot_longer and IMO less intuitive
  • Intermdiate variables everywhere!
sic_10_companies_dt <- dt[, .N, by=.(SICCode.SicText_1)][ N >= 10, .(SICCode.SicText_1) ]
dt_companies_wide <- dt[ sic_10_companies_dt,  # This is a join!
                         .(CompanyNumber, 
                           first_classification = SICCode.SicText_1,
                           SICCode.SicText_1,
                           SICCode.SicText_2,
                           SICCode.SicText_3,
                           SICCode.SicText_4),
                          on=.(SICCode.SicText_1)]
dt_companies_long <- melt(dt_companies_wide, id.vars=c('CompanyNumber', 'first_classification'))
dt_companies_mean <- dt_companies_long[ value != '',  # Removes the unused SIC columns
                                        .N, 
                                        by=.(CompanyNumber, first_classification)][, 
                                                                                   .(mean_classifications = mean(N, na.rm=T)), 
                                                                                   by=.(first_classification)]
head(dt_companies_mean[ order(mean_classifications, decreasing = TRUE)])
                                                 first_classification
1:                                        7210 - Hardware consultancy
2:                                        07100 - Mining of iron ores
3:                                     18110 - Printing of newspapers
4:                                              10611 - Grain milling
5:                       14131 - Manufacture of other men's outerwear
6: 01280 - Growing of spices, aromatic, drug and pharmaceutical crops
   mean_classifications
1:             3.000000
2:             2.478261
3:             2.428571
4:             2.428571
5:             2.401606
6:             2.392157

Speed comparison with tidyverse

tidytable and dtplyr

tidytable: introduction

  • tidytable is a drop-in replacement for common tidyverse functions that under the hood work on a data.table object
  • So (in theory!) you get the speed of data.table but the user friendly API of the tidyverse
  • Just load the library then all subsequent calls to mutate, inner_join, count, select, filter etc… will use the tidytable versions that work on a data.table
  • Beware: not all functions have been ported over and it explicitly overwrites the dplyr, tidyr, purrr functions
  • There’s a lag between changes to tidyverse being reflected in tidytable
library(tidytable)
# Here we explicitly create tidytable from a regular data.frame
# But passing a regular data.frame or data.table into any tidytable function
# will implicitly change it to be a tidytable object
dtt <- as_tidytable(df)

dtt |> 
    count(SICCode.SicText_1) |>
    filter(n >= 10) |>
    select(SICCode.SicText_1) 
# A tidytable: 755 × 1
   SICCode.SicText_1                                                       
   <chr>                                                                   
 1 01110 - Growing of cereals (except rice), leguminous crops and oil seeds
 2 01120 - Growing of rice                                                 
 3 01130 - Growing of vegetables and melons, roots and tubers              
 4 01160 - Growing of fibre crops                                          
 5 01190 - Growing of other non-perennial crops                            
 6 01210 - Growing of grapes                                               
 7 01220 - Growing of tropical and subtropical fruits                      
 8 01240 - Growing of pome fruits and stone fruits                         
 9 01250 - Growing of other tree and bush fruits and nuts                  
10 01270 - Growing of beverage crops                                       
# ℹ 745 more rows

dtplyr: introduction

  • An alternative data.table wrapper is dtplyr (developed by RStudio team)
  • Works differently to tidytable: it sequentially builds up the equivalent data.table query, but only executes the code when you explicitly request it (using collect() or as.data.frame/table())
  • Loading the package doesn’t affect your environment
  • Has less coverage than tidytable
library(dtplyr)

# dtplyr operates on `lazy data.tables` which are only created by this function
dtp <- lazy_dt(df)

dtp |> 
    count(SICCode.SicText_1) |>
    filter(n >= 10) |>
    select(SICCode.SicText_1) 
Source: local data table [755 x 1]
Call:   `_DT1`[, .(n = .N), keyby = .(SICCode.SicText_1)][n >= 10, .(SICCode.SicText_1)]

  SICCode.SicText_1                                                       
  <chr>                                                                   
1 01110 - Growing of cereals (except rice), leguminous crops and oil seeds
2 01120 - Growing of rice                                                 
3 01130 - Growing of vegetables and melons, roots and tubers              
4 01160 - Growing of fibre crops                                          
5 01190 - Growing of other non-perennial crops                            
6 01210 - Growing of grapes                                               
# ℹ 749 more rows

# Use as.data.table()/as.data.frame()/as_tibble() to access results

dtplyr: usage

  • Can view the generated data.table query (subtly different to the one I manually wrote)
dtp |> 
    count(SICCode.SicText_1) |>
    filter(n >= 10) |>
    select(SICCode.SicText_1) |>
    show_query()
`_DT1`[, .(n = .N), keyby = .(SICCode.SicText_1)][n >= 10, .(SICCode.SicText_1)]
  • Run collect() to execute it and return a tibble
dtp |> 
    count(SICCode.SicText_1) |>
    filter(n >= 10) |>
    select(SICCode.SicText_1) |>
    collect() |> 
    head()
# A tibble: 6 × 1
  SICCode.SicText_1                                                       
  <chr>                                                                   
1 01110 - Growing of cereals (except rice), leguminous crops and oil seeds
2 01120 - Growing of rice                                                 
3 01130 - Growing of vegetables and melons, roots and tubers              
4 01160 - Growing of fibre crops                                          
5 01190 - Growing of other non-perennial crops                            
6 01210 - Growing of grapes                                               

dtplyr: chaining queries

  • dtplyr queries that haven’t been collect() can be used in joins
# NB: this returns a datatable QUERY, not a dataset itself
sic_10_companies_dtp <- dtp |> 
    count(SICCode.SicText_1) |>
    filter(n >= 10) |>
    select(SICCode.SicText_1) 

# Can join that query into the middle of another query to return another query
results_dtp <- dtp |>
  inner_join(sic_10_companies_dtp, by="SICCode.SicText_1") |>
  select(CompanyNumber, SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4) |> 
  mutate(first_classification = SICCode.SicText_1) |>
  pivot_longer(c(SICCode.SicText_1, SICCode.SicText_2, SICCode.SicText_3, SICCode.SicText_4)) |>
  filter(!is.na(value)) |>
  count(CompanyNumber, first_classification) |>
  group_by(first_classification) |>
  summarise(mean_classifications = mean(n, na.rm=T)) |>
  arrange(desc(mean_classifications))

# Finally execute the full query
results_dtp |>
  collect() |>
  head()
# A tibble: 6 × 2
  first_classification                                      mean_classifications
  <chr>                                                                    <dbl>
1 7210 - Hardware consultancy                                               3   
2 07100 - Mining of iron ores                                               2.48
3 10611 - Grain milling                                                     2.43
4 18110 - Printing of newspapers                                            2.43
5 14131 - Manufacture of other men's outerwear                              2.40
6 01280 - Growing of spices, aromatic, drug and pharmaceut…                 2.39

Benchmark

Embedded databases

Introduction

  • All these options require reading the full dataset into memory, not viable if we have larger than memory data
  • Embedded relational databases are stored on disk and only read into memory as needed
  • Will look at 2 variants:
    • SQLite (designed for ‘traditional’ DB applications)
    • duckdb (optimised for analysis)
  • They use SQL (Structured Query Language, its own programming language) to interact with the data, but fortunately in R we can use our tidyverse functions just like dtplyr rather than learn a new language

Interfacing with SQLite in R

  • Connect to the DB using dbConnect() from library(DBI)
  • DBs are organised into tables (can think of a table as a CSV file)
  • dbWriteTable() will write a dataframe to the DB
library(DBI)  # General database library
library(RSQLite)
# The first argument is the database driver, the second the database file
# If data.sql doesn't exist, it will be created
con_sql <- dbConnect(SQLite(), "data.sql")
dbWriteTable(con_sql, "data_1e6", df)

SQLite usage

  • Can view SQL query with identical code to dtplyr, except the source data is from tbl
# tbl requires a connection and a table name to read data from
tbl(con_sql, "data_1e6") |> 
  filter(RegAddress.PostTown == 'YORK') |> 
  mutate(postcode = word(RegAddress.PostCode, 1)) |>
  count(postcode) |>
  arrange(desc(n)) |>
  head(5) |>
  show_query()
<SQL>
SELECT `postcode`, COUNT(*) AS `n`
FROM (
  SELECT *, word(`RegAddress.PostCode`, 1.0) AS `postcode`
  FROM `data_1e6`
  WHERE (`RegAddress.PostTown` = 'YORK')
)
GROUP BY `postcode`
ORDER BY `n` DESC
LIMIT 5
  • Running the query errors because the developers haven’t translated word into SQL yet
  • Solution: use the similar function substr from base which has been translated
  • This is more likely to happen the more niche a function is
tbl(con_sql, "data_1e6") |> 
  filter(RegAddress.PostTown == 'YORK') |> 
  mutate(postcode = substr(RegAddress.PostCode, 1, 4)) |>
  count(postcode) |>
  arrange(desc(n)) |>
  head(5) |>
  collect()
# A tibble: 5 × 2
  postcode     n
  <chr>    <int>
1 "YO30"     486
2 "YO19"     325
3 "YO26"     271
4 "YO1 "     241
5 "YO31"     180

duckdb: introduction

  • Designed for fast analytics (column-oriented) whereas SQLite is designed for transactions (row-oriented)
  • Very new, first demo was 2020 (SQLite first release was 2000)
  • Can read directly from CSV or has its own database files like SQLite
  • Use the same dbConnect() function but passing in a different driver
library(duckdb)
con_dd <- dbConnect(duckdb(), "data.duckdb")
dbWriteTable(con_dd, "data_1e6", df)

duckdb: usage

  • Duckdb uses the same SQL language, albeit with subtle differences in available functions
tbl(con_dd, "data_1e6") |> 
  filter(RegAddress.PostTown == 'YORK') |> 
  mutate(postcode = substr(RegAddress.PostCode, 1, 4)) |>
  count(postcode) |>
  arrange(desc(n)) |>
  head(50) |>
  show_query()
<SQL>
SELECT postcode, COUNT(*) AS n
FROM (
  SELECT *, SUBSTR("RegAddress.PostCode", 1, 4) AS postcode
  FROM data_1e6
  WHERE ("RegAddress.PostTown" = 'YORK')
) q01
GROUP BY postcode
ORDER BY n DESC
LIMIT 50
  • word is also not ported to duckdb so again use the substr version
  • Code is again identical to both SQLite and dtplyr
tbl(con_dd, "data_1e6") |> 
  filter(RegAddress.PostTown == 'YORK') |> 
  mutate(postcode = substr(RegAddress.PostCode, 1, 4)) |>
  count(postcode) |>
  arrange(desc(n)) |>
  head(50) |>
  collect()
# A tibble: 23 × 2
   postcode     n
   <chr>    <dbl>
 1 "YO30"     486
 2 "YO19"     325
 3 "YO26"     271
 4 "YO1 "     241
 5 "YO31"     180
 6 "YO24"     178
 7 "YO10"     172
 8 "YO32"     169
 9 "YO23"     161
10 "YO42"     142
# ℹ 13 more rows

Overall benchmark

  • data.table is the fastest! But it requires learning a new ‘language’
  • All of the other options are still much faster than tidyverse and let you use same code
  • tidytable is my personal sweetspot between ease of use and performance gains
  • duckdb and sqlite are also useful when data storage is a concern:
    • CSV: 2.4GB
    • SQLite: 1.9GB
    • Duckdb: 500MB

Benchmark - all 5 million rows

Rcpp

Introduction

  • Sometimes for loops are necessary:
    • No inbuilt vectorised solution
    • Recurrent algorithm such as stepping through time or space
    • Performant critical code and need more specialised data structures
  • Rcpp, which combines R and C++, to the rescue!
  • C++ is compiled which makes it very fast, but it also requires more effort to both write programs in it and interface with R:
  • Rcpp makes this process easy by providing:
    • A C++ library that contains similar data structures and functions to R
    • An R package that compiles C++ code and makes them easily accessible within R

Basic example

  • Can use Rcpp::cppFunction() to write a C++ function as a string or Rcpp::sourceCpp() if it’s in a separate file
  • Both methods do the same:
    • Compile C++ code
    • Create an R function that calls it
  • C++ has many differences with R, having to assign every variable a type is probably the most notable
library(Rcpp)
cppFunction("
double sumRcpp(NumericVector x) {
  // The function definition syntax is:
  // RETURN_TYPE functionName(INPUT_TYPE input, ...)
  int n = x.size();  // R objects have their own type (NumericVector) with useful attributes
  double total = 0;  // Need to instantiate variables before use
  for (int i = 0; i < n; i++) {  // C++ indexes start at 0
    total += x[i];
  }
  return total;      // Need to explicitly return values
}")
# The sumRcpp function is instantly available within R
sumRcpp(c(1, 2, 3))
[1] 6

Syntatic sugar - data structures

  • When calling an Rcpp function, the inputs are automatically converted from their R type into the specified C++ type
  • NumericVector is a special Rcpp data structure that represents a vector of floats, so will accept both c(1.2, 2.4, 3.6), and c(1, 2, 3), but not c('a', 'b', 'c')
  • IntegerVector coerces floats into integers
  • CharacterVector also exists for strings, and NumericMatrix for 2D structures
  • C++ (like most languages but unlike R) differentiates between scalar and vectors
  • Standard C++ scalar data types can be used in Rcpp functions: int, double, char etc…
  • Can also use any other C++ data structure (e.g. STL or from libraries)
  • wrap() is an Rcpp function that converts back from C++ to R, useful when returning at the end of a function!

Syntatic sugar - functions

  • There’s no penalty to using for loops in C++ so they are very common
  • But to save typing boiler plate code, Rcpp provides ‘syntatical sugar’ functions that operate on the R-specific data types
  • Examples: mean, log, exp, sin, any, all
  • The for loop wasn’t necessary!
cppFunction("double sumRcppsugar(NumericVector x) {
  return sum(x);
}")
sumRcppsugar(c(1, 2, 3))
[1] 6

sum benchmarks

sumR <- function(x) {
  total <- 0
  for (i in seq(length(x))) {
    total <- total + x[i]
  }
  total
}
  • The Rcpp implementations are around 30x faster than the R version
  • The syntatic sugar version is the same speed as the for loop
  • The inbuilt sum is highly optimised

Real world example: Kalman Filter

kf_r <- function(y, m, Q=0.5, H=0.5) {
  n <- length(y)
  alpha <- array(NA, dim=c(n+1, m))
  P <- array(NA, dim=c(m, m, n+1))
  alpha[1] <- 0  # Initialise with zero mean and high variance
  P[, , 1] <- 1e3
  Z <- array(1, dim=c(n, m)) 
  for (i in 1:n) {
    P_updated <- P[, , i] + Q
    # Calculate kalman gain
    K <- P_updated %*% t(Z[i, ]) %*% solve(Z[i, ] %*% P_updated %*% t(Z[i, ]) + H)
    # Update state and covariance
    alpha[i+1, ] <- alpha[i, ] + K %*% (y[i] - Z[i, ] %*% alpha[i, ])
    P[, , i+1] <- (diag(m) - K %*% Z[i, ]) %*% P_updated
  }
  list(alpha=alpha, P=P)
}
  • The Kalman Filter is an algorithm that estimates unobserved parameters in a noisy system
  • It is recursive, the estimate at time \(t\) solely depends on the value at time \(t-1\), hence is a good candidate for Rcpp
  • R implementation is straight forward series of matrix operations

Kalman Filter - Rcpp implementation

cppFunction("
NumericVector kf_rcpp(arma::vec y, int m, float Q=0.5, float H=0.5) {
  int n = y.n_rows;
  
  arma::mat alpha(n+1, m, arma::fill::none);
  arma::cube P(m, m, n+1, arma::fill::none);
  arma::mat Z(n, m, arma::fill::ones);
  
  // Initialise with zero mean and high variance
  alpha.row(0).fill(0);
  P.slice(0).diag().fill(1000);
  
  // Run filter
  arma::mat P_updated(m, m);
  arma::mat K(m, m);
  for (int i=0; i<n; i++) {
    P_updated = P.slice(i) + Q;
    // Calculate kalman gain:
    K = P_updated * Z.row(i).t() * arma::inv(Z.row(i) * P_updated * Z.row(i).t() + H);
    // Update state and covariance
    alpha.row(i+1) = alpha.row(i) + K * (y[i] - Z.row(i) * alpha.row(i));
    P.slice(i+1) = (arma::eye(m, m) - K * Z.row(i)) * P_updated;
  }
  
  return wrap(alpha);  // This is crucial, converts the Armadillo matrix into an R NumericVector
}", depends="RcppArmadillo")
  • Rcpp implementation is very similar, only using the RcppArmadillo library for access to 3D arrays
  • Can then call kf_r() or kf_rcpp() identically

Benchmark

  • ~80x quicker in Rcpp!
  • Been able to go from hourly to minutely time-resolution
  • Core library function so worth investing the development time

Parallelisation / Viking

  • For iterative jobs that don’t fit within tabular data can run parallelised for loops, e.g.:
    • Fitting models
    • Network downloads/uploads
    • File processing
  • For ‘small’ jobs can run locally using parallel::mclapply (Linux), doParallel and foreach (Windows), or furrr (all OS, Tidyverse, combines future and purrr)
  • For larger jobs (both duration of each iteration and number of iterations), Viking is very useful with array jobs
  • Viking also useful to free up PC when running a computationally intensive library (e.g. Stan, INLA, Keras). Optimising these is application-specific
  • future.batchtools offers the ability automatically create and submit Slurm job scripts from within R, allowing you to easily switch between running sequentially, local multi-core parallelisation, and independent processes Slurm array jobs. UNTESTED

Thoughts

  • I tend to use all of these strategies with varying frequency:
    • Using inbuilt vectorised functions - daily
    • data.table in some form - weekly
    • Parellel jobs (Viking or local) - monthly
    • Rcpp - couple of times a year
  • As well as speeding up interactive work, faster code is especially useful when developing packages
  • As a result of researching for this talk, I’m going to use duckdb in future for some large datasets

Resources

Misc

Worked example - regex

  • The basename and dirname solution was faster than regex
example_4 <- function(paths) {
  gsub(".+\\/+([[:alnum:]]+)\\/([[:alnum:]]+\\.[[:alnum:]]+)$", "\\1_\\2", paths)
}
example_4(c("foo/bar/car/001.txt", "har/far/lar/002.txt"))
[1] "car_001.txt" "lar_002.txt"
                         expr    median
1 example_1_vectorised(paths) 6973.2335
2    sapply(paths, example_1) 6793.3205
3            example_2(paths) 1232.2500
4            example_3(paths)  125.4765
5            example_4(paths)  387.4380

Case when

  • No speed difference between ifelse and case_when
f_casewhen <- function() {
  df_interval |>
    mutate(week_group = case_when(
      time >= as_date("2020-02-14") & time < as_date("2020-02-21") ~ 'a',
      time >= as_date("2020-03-17") & time < as_date("2020-03-24") ~ 'b',
      time >= as_date("2020-05-08") & time < as_date("2020-05-15") ~ 'c',
      time >= as_date("2020-09-20") & time < as_date("2020-09-27") ~ 'd',
      time >= as_date("2020-11-13") & time < as_date("2020-11-20") ~ 'e',
      .default = NA_character_
    )) |>
      filter(!is.na(week_group))
}

res <- microbenchmark(f_intervaljoin(), f_ifelse(), f_casewhen(), times=10)
summary(res)[c("expr", "median")]
              expr    median
1 f_intervaljoin()  1.930972
2       f_ifelse() 22.516736
3     f_casewhen() 22.306646

Filter vs inner join speed

  • When limiting analysis to the classifications with at least 10 companies, it was quicker to reduce the main dataset by an inner_join than filter
            expr   median
1     f_filter() 898.7851
2 f_inner_join() 800.7263